Learning to Lemmatise Slovene Words

نویسندگان

  • Saso Dzeroski
  • Tomaz Erjavec
چکیده

Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the first is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn to perform morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. The dataset used is the 90.000 word Slovene translation of Orwell’s ‘1984’, split into a training and validation set. The validation set is the Appendix of the novel, on which extensive testing of the two components, singly and in combination, is performed. The trained model is then used on an open-domain testing set, which has 25.000 words, pre-annotated with their word lemmas. Here 13.000 nouns or adjective tokens are previously unseen cases. Tested on these unknown words, our method achieves an accuracy of 81% on the lemmatisation task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The IMP historical Slovene language resources

The paper describes the combined results of several projects which constitute a basic language resource infrastructure for printed historical Slovene. The IMP language resources consist of a digital library, an annotated corpus and a lexicon, which are interlinked and uniformly encoded following the Text Encoding Initiative Guidelines. The library holds about 650 units (mostly complete books) c...

متن کامل

Machine Learning of Morphosyntactic Structure: Lemmatizing Unknown Slovene Words

Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, sin...

متن کامل

A Rule based Approach to Word Lemmatization

Lemmatization is the process of finding the normalized form of a word. It is the same as looking for a transformation to apply on a word to get its normalized form. The approach presented in this paper focuses on word endings: what word suffix should be removed and/or added to get the normalized form. This paper compares the results of two word lemmatization algorithms, one based on if-then rul...

متن کامل

Learning to lemmatise Polish noun phrases

We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while ...

متن کامل

Analysis of Automatic Stress Assignment in Slovene

We tested the ability of humans and machines (data mining techniques) to assign stress to Slovene words. This is a challenging comparison for machines since humans accomplish the task outstandingly even on unknown words without any context. The goal of finding good machine-made models for stress assignment was set by applying new methods and by making use of a known theory about rules for stres...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999